IGLoo: Profiling the Immunoglobulin Heavy chain locus in Lymphoblastoid Cell Lines with PacBio High-Fidelity Sequencing reads

New high-quality human genome assemblies derived from lymphoblastoid cell lines (LCLs) provide reference genomes and pangenomes for genomics studies. However, the characteristics of LCLs pose technical challenges to profiling immunoglobulin (IG) genes. IG loci in LCLs contain a mixture of germline and somatically recombined haplotypes, making them difficult to genotype or assemble accurately. To address these challenges, we introduce IGLoo, a software tool that implements novel methods for analyzing sequence data and genome assemblies derived from LCLs. IGLoo characterizes somatic V(D)J recombination events in the sequence data and identifies the breakpoints and missing IG genes in the LCL-based assemblies. Furthermore, IGLoo implements a novel reassembly framework to improve germline assembly quality by integrating information about somatic events and population structural variantions in the IG loci. We applied IGLoo to study the assemblies from the Human Pangenome Reference Consortium, providing new insights into the mechanisms, gene usage, and patterns of V(D)J recombination, causes of assembly fragmentation in the IG heavy chain (IGH) locus, and improved representation of the IGH assemblies.


Figure S1:
The distribution of the distance of the 3, 357 split site to RSSs.a. the histogram of the distribution of the absolute distance (bp) to the closest RSS, the first bin shows the occurrence of distance between 0 to 10, while the last bin shows the occurrence of the distance over 1000.b. the distribution of the distance (bp) to RSS stratified by IGHJ, IGHV, and IGHD genes.We only show the cases with distance < 50 bp.The distance of J genes are counted on its 5 ′ end, while the distance of V and D genes are counted from their 3 ′ end.           Figure S14: the gene number in its relative position on the chromosome 14 of reference genomes T2T-CHM13, GRCh37, and GRCh38.All D and J genes are listed, but only functional or ORF V genes are listed for simplicity

Figure S3 :
Figure S3: The bird eye view (top) and zoom in view (bottom) of the complete V(D)J recombination event using the pseudogene IGHV1-17.

Figure S4 :
Figure S4: The bird eye view (top) and zoom in view (bottom) of the complete V(D)J recombination event using the pseudogene IGHV3-41.

Figure S5 :
Figure S5: The only V-D only partial recombination event we observed in HPRC from individual HG03492.The blue arrows indicate the reads with a long deletion across the gene IGHD6-19 and IGHV1-24.Note that the 3 ′ end segments of the read stretch to the position between IGHD and IGHJ loci.

Figure S7 :
Figure S7: The bird eye view and the zoom in view of the individual HG02559 with an inverted J sequence event.The non-canonical recombination event involved an inversion between IGHJ2 and IGHJ4, the sequence in between the two genes are inverted as the red and blue sequence shown.

Figure S8 :
FigureS8: The non-canonical recombination event that involved an inversion between IGHD6-13 and IGHD5-12.The sequence in between the two genes, which indicated by the blue arrow, is inverted.In the raw read, the head of the left side red arrow connects to the tail of the blue arrow, and the blue arrowhead connects to the tail of the right side red arrow.The IGHD6-13 gene is on the arrowtail of the blue arrow, which is right next to the IGHD5-12 gene on raw read.

Figure S9 :
Figure S9: The bird eye view and the zoom in view of the individual HG00621 with an inverted V sequence event.The non-canonical recombination event involved an inversion starting from V gene IGHV5-51.

Figure S10 :Figure S11 :
Figure S10: The IGV screenshot of the individual HG00741 with the double V-D recombination event.The segments of the non-canonical recombination event are marked with blue arrows.The event involved a V-D recombination between IGHD6-19 and IGHV3-19, and another V-D recombination event between IGHD4-23 and IGHV3-23.

Figure S12 :
Figure S12: The IGV screenshot of the read mapping on the IGHD loci of the sample NA19240 and the sample's maternal (NA19238) and paternal (NA19239) data.

Figure S13 :
Figure S13: The IGH gene number comparison between IGenotyper (14) and IGLoo --ReAsm in the 47 HPRC samples.a. the V, D, and J gene differences between two methods in IGH locus.The positive value means IGLoo --ReAsm assemblies cover more IGH genes while the negative value means IGenotyper assemblies cover more genes.b. the total number of IGH V, D, and J genes comparison of the two methods.